Computational Statistics II

Unit A.1: Metropolis-Hastings and Gibbs sampling

Tommaso Rigon

University of Milano-Bicocca

Unit A.1

Topics

  • Markov Chain Monte Carlo (MCMC)
  • The Metropolis–Hastings algorithm
  • The Gibbs sampling algorithm
  • Writing clean and efficient R code

Main references

  • Robert, C. P., and Casella, G. (2004). Monte Carlo Statistical Methods. Springer.
  • Roberts, G. O., and Rosenthal, J. S. (2004). General state space Markov chains and MCMC algorithms. Probability Surveys, 1(1), 20–71.
  • Tierney, L. (1994). Markov chains for exploring posterior distributions. Annals of Statistics, 22(4), 1701-176.

Associated R code is available on the website of the course.

Bayesian computations

  • Over the past 30 years, Markov Chain Monte Carlo methods (MCMC) methods have revolutionized Bayesian statistics.

  • Bayesian computational statistics is nowadays a lively and mature research field, compared to the early days. Still, there are several open questions.

“Arguably the biggest challenge is in computation. If MCMC is no longer viable for the problems people want to address, then what is the role of INLA, of variational methods, of ABC approaches?” Alan Gelfand (ISBA bullettin, 2011)

Bayesian inference (recap)

  • Let \bm{X} be the data, following some distribution \pi(\textbf{X} \mid \boldsymbol{\theta}), i.e. the likelihood, with \boldsymbol{\theta} \in \Theta \subseteq \mathbb{R}^p being an unknown set of parameters.

  • Let \pi(\boldsymbol{\theta}) be the prior distribution associated to \boldsymbol{\theta}.

  • In Bayesian analysis, inference is based on the posterior distribution for \boldsymbol{\theta}, defined as \pi(\boldsymbol{\theta} \mid \textbf{X}) = \frac{\pi(\boldsymbol{\theta}) \pi(\textbf{X} \mid \boldsymbol{\theta})}{\int_\Theta\pi(\boldsymbol{\theta}) \pi(\textbf{X} \mid \boldsymbol{\theta}) \mathrm{d} \boldsymbol{\theta}}.

  • Key issue. The normalizing constant, i.e. the above integral, is often intractable \implies no analytical solutions, beyond conjugate cases.

  • Numerical approximations of \int_\Theta\pi(\boldsymbol{\theta}) \pi(\textbf{X} \mid \boldsymbol{\theta}) \mathrm{d} \boldsymbol{\theta} are highly unstable, especially in high dimensions \implies the integrate R function will not work in most cases.

Bayesian inference using sampling

  • It is sometimes possible to sample from the posterior distribution even without knowing the normalizing constant.

  • If we can get random samples \boldsymbol{\theta}^{(1)}, \dots,\boldsymbol{\theta}^{(R)} from the posterior distribution, then we can approximate any functional of interest, i.e. \mathbb{E}(g(\boldsymbol{\theta}) \mid \textbf{X}) \approx \frac{1}{R}\sum_{r=1}^R g(\boldsymbol{\theta}^{(r)}).

  • If \boldsymbol{\theta}^{(1)}, \dots,\boldsymbol{\theta}^{(R)} were independent samples from the posterior distribution, this approximation would be called Monte Carlo integration.

  • Monte Carlo integration is justified by the law of large numbers.

  • In this course, we will consider samples \boldsymbol{\theta}^{(1)}, \dots,\boldsymbol{\theta}^{(R)} that are dependent and follow a Markov Chain \implies Markov Chain Monte Carlo (MCMC).

Markov chains

  • A sequence \textbf{Y}^{(0)}, \textbf{Y}^{(1)}, \dots, \textbf{Y}^{(R)} of random elements is a Markov chain if \mathbb{P}(\textbf{Y}^{(r + 1)} \in A \mid \textbf{y}^{(0)}, \dots, \textbf{y}^{(r)}) = \mathbb{P}(\textbf{Y}^{(r + 1)} \in A \mid\textbf{y}^{(r)}).

  • In other words, the conditional distribution of \textbf{Y}^{(r + 1)} given \textbf{y}^{(0)}, \dots, \textbf{y}^{(r)} is the same as the conditional distribution of \textbf{Y}^{(r + 1)} given \textbf{y}^{(r)}, called transition kernel.

  • Given an initial condition \textbf{y}^{(0)} a Markov chain is fully characterized by its transition kernel, which we assume does not depend on r (homogeneity).

  • In continuous cases, the transition kernel is identified by a conditional density, denoted with k(\textbf{y}^{(r+1)} \mid \textbf{y}^{(r)}).

  • When the sample space is finite, the transition kernel is a transition matrix, say P.

Example: AR(1)

  • Autoregressive processes provide a simple illustration of Markov chains on continuous state-space.

  • Let Y^{(0)} \sim N(30, 1) and let us define Y^{(r)} = \rho Y^{(r-1)} + \epsilon^{(r)}, \qquad \rho \in \mathbb{R}, with the error terms \epsilon^{(r)} being iid according to a standard Gaussian \text{N}(0,1).

  • Then the sequence of Y^{(r)} forms indeed a Markov chain and the transition density is such that (y^{(r)} \mid y^{(r-1)}) \sim \text{N}(\rho y^{(r-1)}, 1).

  • If the parameter |\rho| < 1 then the Markov chain has a more “stable” behaviour.

Example: AR(1)

Invariant distribution

  • An increased level of stability of a Markov chain occurs when the latter admits an invariant or stationary probability distribution.

  • A probability density \pi(\textbf{y}) is invariant for a Markov chain with kernel k if \pi(\textbf{y}^*) = \int k(\textbf{y}^* \mid \textbf{y})\pi(\textbf{y})\mathrm{d}\textbf{y}.

  • This is to say that the marginal distributions of \textbf{Y}^{(r)} and \textbf{Y}^{(r + 1)} are the same and are equal to \pi(\textbf{y}), although \textbf{Y}^{(r)} and \textbf{Y}^{(r + 1)} remain dependent.

  • Roughly speaking, if a Markov chain admits a stationary distribution + some technical conditions, then for R large enough the chain “stabilizes” around the invariant law.

  • In the previous \textup{ar}(1) example the stationary distribution is \text{N}(0, 1 / (1 - \rho^2)).

Invariant distribution

  • Not every Markov chain admits a stationary law. However, the Markov chains constructed for sampling purposes should always converge to an invariant distribution.

  • Indeed, in Markov Chain Monte Carlo the stationary distribution \pi(\textbf{y}) represents the target density from which we wish to simulate.

  • Then, we will make use of the following approximation \int g(\textbf{y}) \pi(\textbf{y})\mathrm{d}\textbf{y} \approx \frac{1}{R}\sum_{r=1}^R g(\textbf{y}^{(r)}), where \textbf{y}^{(1)}, \dots, \textbf{y}^{(R)} are generated according to a Markov chain, with \textbf{y}^{(0)} \sim \pi(\textbf{y}).

  • How to construct a Markov chain that converges to the desired density \pi(\textbf{y})?

  • Before delving into this key problem, let us briefly review the assumptions under which this approximation is a reasonable one.

Regularity conditions

  • We will consider Markov chains that are irreducible, aperiodic, and Harris recurrent.

  • A rigorous presen discrete case to help the intuition.

  • For a more detailed treatment, see Chapter 6 of Robert and Casella (2004).

  • Irreducibility. The chain is irreducible if it does not “get stuck” in a local region of the sample space. In the discrete case the chain is irreducible if all states are connected.

  • Aperiodicity. The chain is aperiodic if it does not has any deterministic cycle.

  • Harris recurrent. The chain is (Harris) recurrent if it visits any region of the sample space “sufficiently often”.

Irreducibility

  • The aforementioned properties are easy to formalize in the discrete setting, namely when the values of the Markov chain are Y^{(r)} \in \{1, 2,\dots\}.

  • The first passage time is the first r for which the chain is equal to j, namely: \tau_j = \inf\{r \ge 1 : Y^{(r)} = j\}, where by convention we let \tau_j = \infty if Y^{(r)} \neq j for every r \ge 1.

  • Moreover, let us denote the probability of return to j in a finite number of step, starting from j' \mathbb{P}(\tau_j < \infty \mid y^{(0)} = j').

  • Hence, the chain is irreducible if \mathbb{P}(\tau_j < \infty \mid y^{(0)} = j') > 0 for all j, j' \in \mathbb{N}.